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Abstract 

Purpose:  Measurement  error  and  transient  variability  affect  vital  signs.  These  issues  are  inconsistently 
considered  in  published  reports  and  clinical  practice.  We  investigated  the  association  between  major 
hemorrhagic  injury  and  vital  signs,  successively  applying  analytic  techniques  that  excluded  unreliable 
measurements,  reduced  transient  variation,  and  then  controlled  for  ambiguity  in  individual  vital  signs 
through  multivariate  analysis. 

Methods:  Vital  sign  data  from  671  adult  prehospital  trauma  patients  were  analyzed  retrospectively. 
Computer  algorithms  were  used  to  identify  and  exclude  unreliable  data  and  to  apply  time  averaging.  An 
ensemble  classifier  was  developed  and  tested  by  cross-validation.  Primary  outcome  was  hemorrhagic 
injury  plus  red  cell  transfusion.  Areas  under  receiver  operating  characteristic  curves  (ROC  AUCs)  were 
compared  by  the  test  of  DeLong  et  al. 

Results:  Of  initial  vital  signs,  systolic  blood  pressure  (BP)  had  the  highest  ROC  AUC  of  0.71  (95% 
confidence  interval,  0.64-0.78).  The  ROC  AUCs  improved  after  excluding  unreliable  data,  significantly 
for  heart  rate  and  respiratory  rate  but  not  significantly  for  BP.  Time  averaging  to  reduce  temporal 
variability  further  increased  AUCs,  significantly  for  BP  and  not  significantly  for  heart  rate  and 
respiratory  rate.  The  ensemble  classifier  yielded  a  final  ROC  AUC  of  0.84  (95%  confidence  interval, 
0.80-0.89)  in  cross-validation. 

Conclusions:  Techniques  to  reduce  variability  in  vital  sign  data  can  lead  to  significantly  improved 
diagnostic  performance.  Failure  to  consider  such  variability  could  significantly  reduce  clinical 
effectiveness  or  confound  research  investigations. 
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1.  Introduction 

Vital  signs  measurement  is  a  routine  aspect  of  clinical 
practice  and  research  protocols.  Although  it  is  known  that 
transient  variability  and  measurement  error  can,  in  principle, 
affect  the  accuracy  of  vital  signs,  what  is  unknown  is  the 
extent  to  which  these  factors  affect  diagnostic  capabilities  in 
actual  clinical  practice.  Vital  signs  fluctuate  through  time 
because  of  transient  perturbations  (eg,  medication  boluses, 
bouts  of  pain,  anxiety,  coughing)  as  well  as  natural  steady- 
state  variability.  In  addition,  the  accuracy  of  vital  sign  data  is 
affected  by  clinicians’  technique  [1].  For  example,  accurate 
blood  pressure  (BP)  measurement  using  a  cuff  requires 
proper  fit  and  positioning  of  the  cuff,  a  relaxed  and  properly 
positioned  extremity,  and  the  absence  of  patient  motion  [2]. 
Significant  discrepancies  have  been  reported  between 
different  methods  of  measuring  noninvasive  BP  [3]. 
Similarly,  respiratory  rate  (RR)  measurement  is  prone  to 
technical  error,  whether  measured  by  a  clinician  [4]  or  by  a 
bedside  monitor  via  impedance  pneumography  (IP)  [5].  In 
one  report,  both  triage  nurses’  measurements  of  RR  and 
electronic  measurement  of  RR  revealed  poor  sensitivity  for 
bradypnea  and  tachypnea,  and  the  authors  referred  to  RR  as 
“the  vexatious  vital”  [4].  Heart  rate  (HR)  monitored  by 
electrocardiography  (ECG)  can  be  unreliable,  that  is,  if 
electrodes  are  improperly  affixed,  and  false  arrhythmia 
alarms  are  commonplace  [6].  Multiple  authors  have  called 
into  question  the  value  of  HR  in  assessing  the  hemodynamic 
state  of  a  patient  because  of  its  variable  relationship  with 
hypovolemia  [7,8].  Finally,  it  is  worth  noting  that  the 
accuracy  of  vital  sign  data  may  vary  considerably  for 
different  makes  and  models  of  measurement  devices  [9-11]. 

The  extent  to  which  these  factors  affect  diagnostic 
capabilities  in  actual  practice  is  relevant  to  the  design  and 
interpretation  of  clinical  investigations.  If  vital  sign  data  were 
often  polluted  by  inaccuracies,  then  there  would  be  a  bias 
toward  the  null  hypothesis,  where  positive  study  effects 
might  be  masked  (ie,  type  II  study  errors).  Alternatively, 
failure  to  describe  key  methodology  that  improved  vital  sign 
accuracy  (eg,  superior  equipment,  training,  or  study  pro¬ 
tocols)  would  make  it  harder  for  others  to  replicate  a 
successful  study.  Consider  that  some  reports  support  the 
usefulness  of  prehospital  severity  scores  for  trauma  patients 
[12-14],  whereas  other  studies  found  those  scores  ineffective 
[15,16].  In  these  examples,  the  reports  lacked  any  explicit 
consideration  of  the  measurement  apparatus,  clinical  pro¬ 
tocols,  and  quality  assurance  processes  related  to  vital  sign 
measurements;  and  inconsistency  in  how  vital  signs  were 
measured  could  have  contributed  to  the  heterogeneous 
findings.  More  broadly,  there  are  diverse  sets  of  conflicting 
reports  with  a  shared  failure  to  detail  vital  sign  measurement 
methodology,  for  example,  the  risk  of  volume  resuscitation 
of  trauma  patients  with  uncontrolled  hemorrhage  [  1 7, 1 8],  the 
benefit  of  rapid  response  teams  for  inpatients  with  physio¬ 
logic  deterioration  [19-22],  and  the  benefit  of  early  goal- 


directed  resuscitation  for  septic  shock  [23].  It  is  possible  that 
different  approaches  to  vital  sign  measurements  contributed 
to  the  inconsistencies  of  the  reports’  findings. 

We  investigated  the  association  between  standard  vital 
signs  and  major  hemorrhagic  injury  in  a  population  of 
prehospital  trauma  patients  using  computational  techniques 
that  excluded  unreliable  measurements,  reduced  transient 
perturbations,  and  reduced  ambiguity  of  individual  vital 
signs.  We  compared  these  results  with  conventional 
analyses.  The  findings  are  applicable  to  the  clinical 
evaluation  of  hemorrhage,  which  is  the  single  most  treatable 
cause  of  mortality  in  trauma  patients  [24,25].  Moreover,  the 
findings  may  relate  to  a  range  of  applications  because  the 
extent  to  which  different  analytic  methods  yield  significantly 
different  results  indicates  the  importance  of  considering 
these  factors  in  clinical  practice  and  research  studies. 


2.  Materials  and  methods 

2.1.  Clinical  data  collection 

This  was  a  retrospective  analysis  of  a  database,  originally 
collected  and  analyzed  by  Cooke  et  al  [26]  with  institutional 
review  board  approval,  of  trauma  patients  during  transport 
by  air  ambulance  from  the  scene  of  injury  to  a  level  I  trauma 
center  [26].  Between  August  2001  and  April  2004,  the 
following  physiologic  data  were  measured  in  a  convenience 
sample  by  Propaq  206EL  monitors  (Protocol  Systems, 
Beaverton,  Ore)  and  archived  using  a  networked  personal 
digital  assistant:  ECG  and  IP  recorded  at  182  and  23  Hz, 
respectively;  the  corresponding  HR  and  RR  output  at 
1 -second  intervals;  and  systolic  BP  (SBP)  and  diastolic  BP 
(DBP)  measured  intermittently  at  multiminute  intervals. 
Clinical  data  were  collected  during  retrospective  chart 
review,  including  demographics,  prehospital  interventions, 
hospital  treatments,  and  injury  descriptions.  Subsequently, 
vital  sign  data  from  788  patients  were  uploaded  to  our  data 
warehousing  system  [27].  Protected  health  information  was 
not  included. 

All  data  analysis  was  performed  using  MATLAB  v7 
(MathWorks,  Natick,  Mass). 

2.2.  Vital  sign  reliability 

For  each  vital  sign  value,  reliable  data  were  identified 
by  automated  algorithms  that  rated  each  datum  on  an 
integer  scale  of  0  to  3  from  least  reliable  to  most  reliable. 
Vital  sign  data  rated  2  or  3  were  considered  reliable; 
otherwise,  they  were  unreliable.  Detailed  descriptions  of 
these  algorithms  have  been  previously  reported  [28-30]. 
Here,  we  provide  an  overview  of  the  methodology.  The 
algorithms  analyze  moving  windows  of  physiologic  data. 
The  algorithms  rate  the  reliability  of  vital  signs  computed 
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from  the  data  windows  based  on  (1)  a  computerized 
assessment  of  the  ECG  or  IP  waveforms’  reliability  and  (2) 
a  comparison  between  the  rates  output  by  the  Propaq 
206EL  vs  an  independent  calculation  of  the  HR  or  RR 
performed  by  the  algorithm.  In  practice,  when  waveforms 
demonstrate  clear,  rhythmic  beats  or  breaths  and  the  rates 
output  by  the  Propaq  206EL  match  the  algorithms’  own 
calculations,  then  the  corresponding  HR  or  RR  is  rated  as 
reliable.  Conversely,  when  the  waveforms  are  noisy  with 
irregular,  heterogeneous  beats  or  breaths  and/or  there  were 
major  discrepancies  between  the  rates  output  by  the  Propaq 
206EL  vs  the  algorithms’  own  calculations,  then  the  HR  or 
RR  is  rated  as  unreliable.  The  underlying  rationale  is  the 
assumption  that  clean  ECG  or  IP  waveforms  lead  to 
reliable  HR  or  RR  measurements  and  that  HR  or  RR  tends 
to  be  reliable  when  2  independent  calculation  methods 
yield  similar  results. 

In  prior  validation,  the  reliability  rating  of  RR  using  the 
automated  algorithms  typically  concurred  with  clinicians 
who  independently  applied  the  reliability  criteria  to  a  set  of 
test  cases  [28,30].  In  99%  of  the  test  cases,  the  automated 
algorithm  agreed  with  the  clinician  RR  rating  (±1  level), 
where  high  RR  reliability  ratings  were  found  to  be 
associated  with  smaller  differences  between  computer- 
calculated  and  human-calculated  RR  (average  differences 
of  1.7  and  8.1  breaths  per  minute  for  the  best  and  worst 
RR  reliability  ratings,  respectively).  Likewise,  there  was 
close  agreement  (within  ±5  beats  per  minute)  between 
computer-calculated  and  human-calculated  HR  in  97%  of 
the  test  cases  rated  2  or  3  by  the  automated  HR  reliability 
algorithm  [30]. 

The  BP  reliability  algorithm  determined  if  the  ratio 
between  SBP,  DBP,  and  mean  pressure  is  physiologic  and 
if  the  HR  measured  by  the  inflatable  oscillometric  cuff 
matches  the  ECG  HR  [29].  The  algorithm  does  not  attempt 
to  distinguish  between  unequal  HRs  because  of  motion 
artifact  vs  unequal  HRs  because  of  nonperfusing  electrical 
beats,  for  example,  premature  contractions;  in  the  latter  case, 
it  would  be  possible  for  reliable  BP  data  to  be  misclassified 
as  unreliable. 

2.3.  Subject  selection 

The  primary  study  population  consisted  of  patients  with 
any  reliable  vital  sign  datum  within  the  initial  1 5  minutes  of 
prehospital  monitoring.  We  also  studied  3  subgroups: 
patients  with  pairs  of  at  least  1  reliable  and  1  unreliable  ( a ) 
HR,  (b)  RR,  and  (c)  BP.  In  the  primary  analysis,  we  excluded 
the  “ambiguous  outcome”  patients  who  received  red  blood 
cell  (RBC)  transfusions  but  lacked  documented  injuries  that 
were  indisputably  hemorrhagic  (see  below).  These  cases 
were  reincluded  in  sensitivity  analyses  (see  below).  Also 
excluded  were  the  few  patients  who  died  before  any 
diagnostic  imaging  or  surgical  exploration,  when  it  could 
not  be  determined  whether  the  patient  died  to  major 
hemorrhage  vs  other  critical  pathology. 


2.4.  Primary  outcome 

Major  hemorrhagic  injury  was  defined  as  a  documented 
injury  that  unequivocally  causes  some  loss  of  blood  volume 
(laceration  or  fracture  of  a  solid  organ,  thoracic  or  abdominal 
hematoma,  vascular  injury  that  required  operative  repair,  or 
limb  amputation)  and  RBC  transfusion  within  24  hours. 

2.5.  Comparison  of  reliable  vs  unreliable  vital  signs 

We  computed  the  patients’  proportions  of  reliable  vital 
signs  (median  and  interquartile  range).  For  the  3  subgroups 
with  at  least  1  reliable  and  1  unreliable  vital  sign — HR,  RR, 
and  BP — we  computed  each  patient’s  mean  of  the  reliable 
and  of  the  unreliable  data  and  compared  the  population  mean 
of  the  subjects’  means  with  Student  t  test  for  paired  data 
(note  that  the  t  test  is  valid  for  normal  and  nonnormal 
distributions  as  long  as  there  are  enough  subjects  per 
distribution,  eg,  30  or  more  [31]). 

To  compare  diagnostic  performance,  we  repeated  the 
following  statistical  computation  100  times  for  each  vital 
sign:  from  each  patient,  we  randomly  selected  1  reliable  and 
1  unreliable  measurement,  then  computed  receiver  operating 
characteristic  (ROC)  curves  for  the  selected  reliable  and  the 
unreliable  data  using  the  method  of  DeLong  et  al  [32].  We 
computed  the  difference  between  the  areas  under  those 
curves  (ROC  AUCs)  and  averaged  the  results  from  the  100 
cycles.  This  methodology  avoided  biases  due  to  those 
patients  with  a  surplus  of  measurements  and  unequal  ratios 
of  reliable  vs  unreliable  measurements  between  patients. 

2.6.  Association  between  vital  signs  and  major 
hemorrhagic  injury  within  the  initial  15  minutes 

For  each  vital  sign,  we  computed  the  univariate  ROC 
AUC  for  (a)  the  first  nonzero  value,  (i b )  the  first  reliable 
value,  (c)  the  last  reliable  value,  and  ( d)  the  average  of  all 
reliable  values  within  15  minutes. 

We  performed  multivariate  analysis  using  ensemble 
classification,  a  collection  of  multivariate  regression  models. 
Each  of  the  models  within  the  ensemble  is  a  standard  linear 
regression  model,  and  their  outputs  are  simply  averaged  to 
yield  the  ensemble  classifier  output  [33].  Ensemble  classi¬ 
fication  is  able  to  classify  subjects  with  incomplete  data,  as  is 
explained  below.  This  property  was  important  because  many 
patients  lacked  reliable  data  for  every  vital  sign. 

Each  regression  model  within  the  ensemble  used  1,  2,  or  3 
of  the  following  parameters:  HR,  RR,  SBP,  and  SBP  -  DBP. 
The  final  ensemble  was  composed  of  all  possible  combinations 
(14  total  regression  models).  We  applied  cross-validation, 
randomly  partitioning  50%  of  the  study  population  for 
classifier  training.  Each  model  was  trained  using  the  subset 
of  patients  who  possessed  at  least  1  reliable  measurement  of 
each  model  parameter  within  the  initial  1 5  minutes,  using  the 
average  of  all  reliable  values  from  the  initial  1 5  minutes.  Next, 
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Table  1  Population  description 

Characteristic 

Study  population 

Population  size,  n 

671 

Male/female,  na 

498/172 

Age  (y),  mean  (SD) 

38  (15) 

Blunt  injury,  n  (%) 

596  (89) 

Mortality,  n  (%) 

41  (6) 

Prehospital  intubation,  n  (%) 

115  (17) 

Major  hemorrhagic  injury,  n  (%) 

78  (12) 

%  Reliable  HR  for  patient,  median  (IQR) 

62  (4-84) 

%  Reliable  RR  for  patient,  median  (IQR) 

16  (0-45) 

%  Reliable  SBP  for  patient,  median  (IQR) 

100  (83-100) 

Patients  with  at  least  1  reliable  vital  sign  datum  within  1 5  minutes  after 
exclusion  of  cases  who  received  RBC  transfusions  but  lacked 
documented  injuries  that  were  indisputably  hemorrhagic  (see  text  for 
details).  IQR  indicates  interquartile  range. 

a  Sex  unknown  for  1  patient  in  the  database. 


we  tested  the  ensemble  classifier  in  the  remaining  50%  of  the 
patients.  For  each  patient,  we  only  used  those  regression 
models  for  which  the  patient  had  the  necessary  reliable  data 
during  the  initial  15  minutes  and  used  the  models’  average 
output  as  the  final  output.  This  process  was  repeated  for  100 
cycles,  each  time  randomly  repartitioning  the  patients  into 
training/testing  sets.  We  computed  the  mean  ROC  AUC  of 
those  100  testing  cycles. 

2.7.  Sensitivity  analyses 

We  repeated  the  ensemble  classification  using  4  alterna¬ 
tive  methodologies:  (a)  reinclusion  of  the  “ambiguous  out¬ 
come”  patients,  treating  them  as  nonhemorrhage  control 
cases;  (, b )  redefining  “major  hemorrhagic  injury”  as  a  docu¬ 
mented  hemorrhagic  injury,  as  above,  plus  RBC  transfusion 
or  at  least  3  L  of  crystalloid  infusion;  (c)  redefining  “major 
hemorrhagic  injury”  as  the  receipt  of  at  least  5  U  of  RBC 


regardless  of  the  documented  injuries;  and  (< d)  using  reliable 
vital  sign  data  from  only  the  initial  10  minutes. 


3.  Results 

The  database  had  788  records  with  at  least  1  nonzero  vital 
sign  datum.  One  hundred  seventeen  cases  were  excluded 
(105  were  “ambiguous  outcome”  cases  subsequently  rein¬ 
troduced  in  the  sensitivity  analysis  described  below,  whereas 
12  lacked  any  reliable  vital  sign  data).  Table  1  shows  the 
population  characteristics,  with  12%  having  major  hemor¬ 
rhagic  injury,  17%  with  prehospital  intubation,  and  6% 
overall  mortality.  Respiratory  rate  data  had  the  lowest  rate  of 
reliability,  whereas  BP  data  had  the  highest. 

Table  2  shows  reliable  data  compared  with  unreliable 
data.  Unreliable  measurements  of  HR,  RR,  and  SBP  all  had 
significantly  elevated  values  vs  their  reliable  counterparts 
and  tended  to  have  reduced  ROC  AUCs. 

Table  3  reports  the  cumulative  diagnostic  yields  of 
the  investigative  techniques.  The  ROC  AUCs  were  signi¬ 
ficantly  improved  for  initial  HR  and  initial  RR  when  relia¬ 
bility  was  considered.  The  ROC  AUCs  were  significantly 
improved  for  SBP  when  the  average  of  all  its  reliable  values 
was  used,  whereas  these  were  nonsignificantly  increased 
for  the  average  of  reliable  HR  or  RR.  (In  regard  to  the 
effects  of  mechanical  ventilation  on  RR,  the  average  of  all 
reliable  RR  yielded  an  ROC  AUC  of  0.72  [95%  confidence 
interval  {Cl},  0.62-0.80]  for  spontaneously  breathing 
patients  and  0.64  [95%  Cl,  0.46-0.78]  for  mechanically 
ventilated  patients.) 

Applied  to  all  671  patients  in  the  study  population,  the 
ensemble  classifier  yielded  an  ROC  AUC  of  0.84  (95%  Cl, 
0.80-0.89)  in  cross-validation.  This  AUC  was  significantly 
greater  than  any  univariate  vital  sign.  The  classifier  could 
identify  36%  of  major  hemorrhagic  injury  cases  with  greater 


Table  2  Reliable  compared  with  unreliable  vital  signs 

Vital 

Population  with 

Patients  with 

Patients’ 

Reliable  data, 

Unreliable  data, 

Reliable  vs 

Unreliable  vital  signs: 

sign 

at  least  1  reliable 

major 

proportion  of 

mean  (SD) 

mean  (SD) 

unreliable  data, 

AROC  AUC  for  Dx 

and  1  unreliable 

hemorrhagic 

reliable  data  (%), 

P  value 

of  major  hemorrhagic 

vital  sign,  n 

injury,  n  (%) 

median  (IQR) 

(Student  t  test) 

injury,  mean 
(upper/lower  range) 

HR 

632 

72  (11) 

65  (7-85) 

95  (20) 

99  (20) 

<.001  a 

-0.02  (-0.05/+  0.01) 

RR 

388 

52  (13) 

39  (20-61) 

27  (7) 

37  (17) 

<.001  a 

-0.11  (— 0. 1 8/— 0.03) 

SBP 

217 

34  (16) 

75  (67-86) 

127  (22) 

138  (37) 

<.001  b 

-0.12  (-0.21/-0.03) 

DBP 

221 

34  (15) 

75  (67-86) 

72  (15) 

76  (75) 

NS 

-0.02  (-0.09/+  0.04) 

Populations  included  only  those  patients  determined  to  have  at  least  1  reliable  and  1  unreliable  vital  sign  measurement,  according  to  the  reliability  algorithms, 
at  any  time  during  their  transport.  Shown  are  the  patients’  means  of  reliable  vs  unreliable  data  for  all  patients  (computing  first  the  mean  of  each  patient  and 
then  computing  the  mean  of  the  patients’  means).  Student  t  test  for  paired  data  was  used  to  test  for  significant  differences  between  patients’  means.  Finally, 
the  change  in  ROC  AUC  in  the  diagnosis  of  major  hemorrhagic  injury  is  shown,  when  one  random  unreliable  measurement  was  used  in  place  of  a  random 
reliable  measurement  (see  text  for  details  of  this  calculation).  NS  indicates  not  significant.  Dx  indicates  diagnosis. 
a  Reliable  vs  unreliable  data  are  also  significant  (P  <  .001)  in  hemorrhage  cases  alone  and  in  control  cases  alone. 
b  Reliable  vs  unreliable  data  are  also  significant  in  hemorrhage  cases  alone  (P  =  .01)  and  in  control  cases  alone  (P  <  .001). 
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Table  3  Areas  under  receiver  operating  characteristic  curves  for  the  diagnosis  of  major  hemorrhagic  injury  with  application  of  vital  sign 
reliability  criteria,  time  averaging,  and  multivariate  (ensemble)  classification 


Vital  sign 

Population 

ROC  AUC  (95%  Cl) 

First  nonzero 

First  reliable 

Last  reliable 

All  reliable 

HR 

At  least  1  reliable  HR  (n  =  625) 

0.60  (0.53-0.68) 

0.71  (0.63-0.77) a 

0.72  (0.65-0.78) a 

0.73  (0.65-0.79) a 

RR 

At  least  1  reliable  RR 
and  intubated  (n  =  85) 

0.52  (0.46-0.58) 

0.64  (0.55-0.72) a 

0.63  (0.53-0.71) 

0.67  (0.58-0.75) a 

RR 

At  least  1  reliable  RR  and 
spontaneous  breathing  (n  =  313) 

0.55  (0.48-0.61) 

0.64  (0.53-0.74) 

0.68  (0.56-0.77) a 

0.72  (0.62-0.80) a 

SBP 

At  least  1  reliable  SBP  (n  =  648) 

0.71  (0.64-0.78) 

0.74  (0.67-0.80) 

0.77  (0.70-0.83) 

0.79  (0.73-0.85)  a’b 

DBP 

At  least  1  reliable  DBP  (n  =  648) 

0.62  (0.54-0.69) 

0.64  (0.56-0.71) 

0.64  (0.56-0.71) 

0.63  (0.55-0.71) 

Ensemble  classifier 

At  least  1  reliable  HR  or  reliable 

NA 

NA 

NA 

0.84  (0.80-0.89) c 

RR  or  reliable  SBP  (n  =  671) 


Ensemble  classification  was  applied  to  the  overall  study  population.  For  RR,  results  are  also  provided  separately  for  intubated  patients  and  for  spontaneously 
breathing  patients.  The  method  of  DeLong  [32]  for  paired  data  was  used  to  test  for  statistically  significant  differences  of  ROC  AUCs.  NA  indicates 
not  applicable. 

a  ROC  AUC  significantly  (P  <  .05)  increased  vs  ROC  AUC  for  “first  nonzero”  value. 

b  ROC  AUC  significantly  (P  <  .05)  increased  vs  ROC  AUC  for  “first  reliable”  data. 

c  Ensemble  ROC  AUC  significantly  increased  vs  ROC  AUC  for  “all  reliable”  HR  data  (P  <  .001),  “all  reliable”  RR  data  (P  <  .001),  “all  reliable”  SBP 
data  (P  <  .05),  and  “all  reliable”  DBP  data  (P  <.00 1). 


than  60%  positive  predictive  value  (PPV)  and  greater  than 
85%  of  hemorrhage  cases  with  24%  PPV  (Fig.  1). 

The  sensitivity  analyses  yielded  the  following  ROC 
AUCs  for  major  hemorrhagic  injury,  which  were  similar 


to  the  primary  analysis:  (a)  inclusion  of  the  ambiguous 
outcome  patients,  0.82  (95%  Cl,  0.77-0.87);  (b)  use  of  RBC 
transfusion  or  at  least  3  L  of  crystalloid  infusion  as  the 
outcome,  0.83  (95%  Cl,  0.79-0.87);  (c)  inclusion  of 


heart  rate  (bpm) 


systolic  (mmHg) 


100 


0.0  0.5  1.0 

classifier  output 


Fig.  1  Histograms  of  basic  vital  signs  and  of  the  multivariate  ensemble  classifier  for  major  hemorrhagic  injury  cases  vs  control  cases. 
Histograms  for  each  basic  vital  sign  (HR,  RR,  SBP,  and  DBP)  using  the  first  nonzero  value  and  the  output  of  the  multivariate  ensemble 
classifier  (using  cross-validation  with  distinct  training/testing  data;  see  text  for  details).  Patient  populations  for  each  histogram  correspond  to 
the  populations  in  Table  3,  whereas  multivariate  ensemble  classification  was  applied  to  the  entire  study  population.  Right:  Ensemble  output 
(testing  data)  averaged  from  100  iterations  of  cross-validation.  Using  one  cutoff,  ensemble  classification  yielded  a  sensitivity  of  greater  than 
85%  and  a  PPV  of  24%;  patients  below  this  threshold  he  in  the  green  background  field.  Using  an  alternative  cutoff,  ensemble  classification 
offered  a  sensitivity  of  36%  and  a  PPV  of  greater  than  60%;  patients  above  this  threshold  lie  in  the  red  background  field.  *Green  zone:  383 
control  cases  and  11  major  hemorrhagic  injury  cases;  ^yellow  zone:  192  control  cases  and  39  major  hemorrhagic  injury  cases;  ^red  zone:  18 
control  cases  and  28  major  hemorrhagic  injury  cases. 
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ambiguous  outcome  patients  and  changing  the  outcome 
to  the  receipt  of  at  least  5  U  of  RBC,  0.81  (95%  Cl,  0.76- 
0.86);  and  ( d)  use  of  only  the  initial  10  minutes,  0.81  (95% 
Cl,  0.76-0.86). 


4.  Discussion 

We  found  that  accounting  for  measurement  error  and 
physiologic  variability  can  significantly  improve  the  associ¬ 
ation  between  vital  signs  and  major  hemorrhagic  injury.  Vital 
signs  may  be  more  informative  about  a  trauma  patient’s 
circulatory  state  than  previously  appreciated  in  reports  that 
did  not  explicitly  consider  these  factors  [26,34-36].  More¬ 
over,  these  findings  may  inform  the  design  and  interpretation 
of  a  range  of  clinical  trials  that  involve  vital  signs  and  how 
vital  signs  are  used  in  clinical  practice.  The  implications  are 
cautionary,  suggesting  that  such  factors  are  important  to 
consider.  At  the  same  time,  these  findings  also  suggest 
potential  solutions. 

The  computational  techniques  used  in  this  analysis  have 
been  previously  described  [28-30,33,37,38].  Here,  the 
techniques  were  integrated  to  determine  their  cumulative 
effects  in  a  population  of  trauma  patients.  These  techniques 


significantly  improved  the  association  of  vital  signs  and 
major  hemorrhagic  injury  without  the  need  for  consideration 
of  the  patients’  baseline  vital  signs,  administration  of 
medications,  anatomical  location  of  the  injury,  age,  or 
mechanism  of  injury.  Applied  cumulatively,  diagnostic 
performance  exceeded  prior  reports  on  the  individual 
techniques  [33].  The  vital  sign  patterns  correctly  classified 
by  these  techniques  were  not  always  self-evident  by  eye 
(eg,  Fig.  2). 

4.1.  Clinical  implications 

We  have  shown  that  reliable  vital  sign  data  have  a 
significantly  higher  association  with  a  life-threatening 
pathophysiology,  even  as  unreliable  measurements  were 
commonplace  (Table  1).  These  findings  support  the 
adherence  to  proper  vital  sign  measurement  techniques; 
even  better  than  excluding  unreliable  data,  as  was  done  in 
this  retrospective  study,  would  be  reducing  unreliable 
measurements  in  the  first  place.  When  procuring  monitoring 
apparatus,  it  would  be  desirable  to  prioritize  makes  and 
models  that  possess  maximum  accuracy  [1 1-13].  In  addition, 
the  study  implies  a  potential  benefit  to  continuing  training  of 
clinical  staff  to  enhance  the  diagnostic  value  of  vital  sign 
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Fig.  2  Case  examples  for  which  the  presence  or  absence  of  major  hemorrhagic  injuries  can  be  identified  by  patterns  in  the  vital  signs.  Both 
cases  had  HRs  of  more  than  120  beats  per  minute  and  normotension.  In  patient  1,  the  ensemble  multivariate  classifier — which  weighs  the  HR, 
RR,  and  systolic  and  pulse  pressures — indicated  that  major  hemorrhagic  injury  was  unlikely  (ie,  the  classifier  output  lay  in  the  low-risk  green 
zone,  with  a  97%  negative  predictive  value;  see  Fig.  1).  Patient  1  did  not  require  RBC  transfusion  and  was  diagnosed  with  a  cerebral  contusion 
and  a  neck  injury  without  major  vascular  involvement.  In  patient  2,  the  ensemble  multivariate  classifier  indicated  that  major  hemorrhagic 
injury  was  probable  (ie,  the  classifier  output  lay  in  the  high-risk  red  zone,  with  a  60%  PPV;  see  Fig.  1).  Patient  2  had  a  grade  III  liver  laceration; 
a  fractured,  disrupted  pelvis;  and  a  femur  fracture  and  received  3  U  of  RBC.  Thin  lines  and  open  triangles  indicate  unreliable  data  according  to 
the  automated  algorithms;  thick  lines  and  solid  triangles  indicate  reliable  data. 
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measurement.  Sources  of  unreliable  vital  sign  data  include 
poor  electrode  placement  (eg,  chest  hair  causing  poor  skin 
adhesion),  excessive  patient  movement,  and  poor  placement 
of  BP  cuffs.  It  would  be  truly  revealing  to  study, 
prospectively,  which  sources  of  error  are  the  most  problem¬ 
atic  and  whether  the  association  between  vital  signs  and 
pathology  can  be  enhanced  through  focused  training. 

Certain  techniques  suggested  by  this  report  might  be 
applied  at  the  bedside  to  assess  the  state  of  the  casualty. 
For  example,  when  patients  arrive  at  the  hospital,  clinicians 
expecting  obvious  vital  sign  trends  might  be  misled  because 
we  have  found  that  transient  perturbations  may  mask  the 
underlying  trends  and  that  measurements  made  at  the  end 
of  transport  are  not  necessarily  more  useful  than  the 
preceding  prehospital  measurements.  Shapiro  et  al  [39]  and 
Lipsky  et  al  [40]  reported  that,  among  patients  who  arrived 
normotensive  in  the  emergency  department,  one  or  more 
episodes  of  preceding  hypotension  were  associated  with 
higher  acuity.  Our  findings  suggest  that,  in  addition  to  the 
most  recent  measurements,  clinicians  should  consider  the 
time  average  of  recent  data,  which  we  have  shown  can  be 
significantly  more  diagnostic. 

This  raises  the  question  of  what  duration  of  time  window 
is  optimal  for  computation  of  average  values  of  recent  vital 
sign  data,  for  example,  5,  15,  or  60  minutes.  The  goal  of  the 
time  averaging  is  to  filter  out  transient  perturbations;  but  if 
the  time  window  gets  too  large,  then  time  averaging  can 
actually  obscure  trends  developing  in  later  data.  Therefore,  it 
is  important  that  the  time  window  should  not  be  too  large. 
We  speculate  that  averaging  over  more  than  15  minutes  may 
not  be  diagnostically  optimal,  but  this  is  difficult  to  answer 
definitively  with  the  current  data  set  because  the  records  are 
of  such  heterogeneous  duration. 

Simultaneous  consideration  of  multiple  vital  signs  can 
also  improve  the  value  of  the  data.  For  instance,  low  BP 
could  represent  significant  blood  loss,  the  patient’s  normal 
baseline,  or  reduced  adrenergic  tone.  Tachycardia  and 
tachypnea  suggest  the  former,  normal  rate  and  respiration 
suggest  baseline  physiology,  and  bradycardia  and  bradypnea 
suggest  sympatholysis.  Clinicians  may  be  unable  to  mentally 
compute  a  multivariate  statistical  model;  but  a  simple 
multivariate  metric,  such  as  the  shock  index  (the  ratio  of 
HR  and  SBP  [41,42]),  can  be  applied  at  the  bedside. 

4.2.  Research  implications 

We  demonstrated  that  accounting  for  sources  of  mea¬ 
surement  variability  can  yield  significantly  different  results 
when  analyzing  vital  sign  data.  Accordingly,  we  recommend 
the  following  steps  for  clinical  research  involving  vital 
sign  data:  (a)  report  the  make  and  model  of  any  monitoring 
equipment  used  and,  when  available,  provide  accuracy 
citations  [12,13,43];  (i b )  report  relevant  in-service  training,  or 
its  absence,  of  the  clinical  staff;  (c)  keep  the  measurement 
environment  as  consistent  as  possible  to  reduce  transient 
variability,  or  else  use  the  average  of  several  measurements; 


and  (d)  consider  the  use  of  validated  clinical  scores  or 
propensity  scores  to  supplement  or  replace  individual 
vital  signs. 

In  addition,  we  note  that  there  has  been  academic  interest 
in  novel  types  of  physiologic  sensors  intended  to  improve 
patient  monitoring.  The  cost  and  effort  necessary  to  adopt 
new  sensor  modalities  might  be  weighed  against  the  findings 
in  this  report,  which  are  that  standard  vital  signs  can  be 
significantly  improved  through  application  of  some  simple 
techniques.  Academically,  we  suggest  that  new  monitoring 
modalities  should  be  directly  compared  against  conventional 
monitoring,  with  consideration  given  to  the  sources  of 
variability  highlighted  here. 

4.3.  Specific  findings 

Systolic  BP  was  the  best  univariate  predictor.  We  [37] 
and  others  [44,45]  have  previously  found  that  prehospital 
trauma  patients  demonstrate  substantial  temporal  variability. 
We  reduced  the  effects  of  transient  perturbations  by  using  the 
time  average  of  serial  vital  sign  measurements,  which 
yielded  significantly  higher  ROC  AUCs  for  SBP,  higher 
than  either  the  initial  or  final  prehospital  SBP.  Diastolic  BP 
alone  was  a  weak  predictor;  but  we  found  that  it  provides 
additional  information  independent  of  SBP  because  it  is 
useful  to  compute  pulse  pressure,  the  difference  between 
SBP  and  DBP  [33].  In  spontaneously  breathing  patients, 
reliable  RR  was  a  useful  predictor  of  hemorrhage.  This 
finding  was  anticipated  by  classic  physiologic  reports  that 
demonstrated  that  blood  flow  to  the  carotid  body  chemore- 
ceptors  is  reduced  in  early  hemorrhage  because  of  compen¬ 
satory  vasoconstriction.  “Stagnant”  hypoxia  then  develops  in 
the  chemoreceptors,  triggering  an  increased  respiratory  drive 
and  tachypnea  [46-49].  Interestingly,  this  RR  reliability 
algorithm  was  not  originally  developed  to  diagnose  major 
hemorrhagic  injury  per  se,  but  to  identify  intervals  in  the  IP 
that  matched  clinicians’  opinions  that  the  respiratory 
waveform  was  rhythmic  and  consistent  [28].  Used  as  a 
diagnostic  tool,  we  found  that  reliable  RR  data  were 
significantly  more  diagnostic  than  unreliable  RR.  We 
observed  that  unreliable  RR  was  often  falsely  elevated  (ie, 
biased)  because  of  motion  artifacts  in  the  pneumogram  that 
were  incorrectly  counted  as  additional  breaths. 

Only  a  subset  of  patients  (59%)  had  a  complete  set  of 
reliable  vital  signs  within  15  minutes.  This  was  consistent 
with  prior  reports  that  unreliable  vital  sign  data  are  all  too 
typical  in  clinical  practice  [1,2, 4-6].  To  deal  with  missing 
data,  we  used  an  ensemble  classifier  for  multivariate 
classification,  which  was  significantly  better  than  univariate 
classification.  In  a  prior  report,  the  ensemble  classifier  was 
applied  to  a  moving  2-minute  window  of  vital  sign  data  [33]. 
That  approach  was  not  as  successful  because,  in  any  given 
2-minute  window,  there  was  an  exaggerated  proportion  of 
missing  data  and  there  was  major  minute-to-minute  vari¬ 
ability  that,  here,  we  successfully  filtered  out  by  time 
averaging  over  15  minutes  (see  above).  In  addition,  the 
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current  ensemble  uses  pulse  pressure  instead  of  DBP  and 
does  not  incorporate  oxygen  saturation,  thus  excluding  weak 
univariate  predictors. 

4.4.  Automated  diagnostic  algorithms 

It  is  technically  feasible  to  run  this  investigation’s 
analysis  algorithms  in  real  time,  automatically  distinguishing 
between  normovolemic  vs  hemorrhagic  vital  sign  patterns. 
We  speculate  that  such  automated,  continuous  analysis 
could  improve  the  quality  and  safety  of  any  monitored 
patient,  especially  when  the  clinical  staff  is  distracted  or 
inexperienced.  In  addition,  protocols  for  triage  or  resusci¬ 
tation  could  be  considered  using  the  algorithm’s  output  as  a 
starting  point  that  may  be  more  clinically  valid  than  any  sole 
vital  sign.  Lastly,  in  some  cases,  the  algorithm  could 
enhance  the  judgment  of  the  clinician  (eg,  cases  such  as  in 
Fig.  2).  Similar  types  of  automated  analysis  of  vital  sign  data 
may  likewise  prove  useful  for  other  clinical  applications, 
such  as  early  detection  of  acute  deterioration  of  hospital 
ward  patients  [50]. 

4.5.  Limitations 

There  are  several  factors  to  consider  in  terms  of  the 
internal  validity  of  this  study.  First,  there  is  no  gold  standard 
definition  to  retrospectively  distinguish  true  hemorrhagic 
injury  vs  minor  (or  non)  hemorrhagic  injures.  We  therefore 
analyzed  several  alternative  outcome  definitions.  The  similar 
results,  regardless  of  the  specific  definition,  suggest  that  the 
findings  were  not  an  artifact  of  the  outcome  definition  but 
will  be  similar  given  any  reasonable  definition  of  hemor¬ 
rhagic  injury  (note  that  our  database  did  not  contain 
parameters  such  as  base  deficit  and  pH).  Our  findings 
would  be  further  strengthened  if  future  investigations 
demonstrate  comparable  findings  given  additional  end  points 
and  pathologic  processes. 

As  a  second  limitation,  the  present  findings  depended  on 
our  algorithms  to  identify  reliable  vital  signs;  and  the  results 
might  be  different  with  different  algorithms.  However,  in 
developing  these  algorithms,  we  found  that  most  analytic 
methodologies  that  we  explored  yielded  similar  results 
because,  in  practice,  the  different  algorithms  only  differed 
about  borderline  cases,  a  minority  of  the  data  set  [51].  In 
most  of  the  cases,  which  were  clearly  reliable  (eg,  HR  based 
on  very  clean  ECG)  or  clearly  unreliable  (eg,  HR  based  on 
very  noisy  ECG),  different  versions  of  the  algorithms  that  we 
explored  yielded  consistent  ratings  of  vital  sign  reliability. 
(Note  that  these  reliability  algorithms  were  not  a  priori 
developed  to  diagnose  major  hemorrhage  but  to  match 
clinicians’  opinions  regarding  whether  waveform  segments 
were  clean  with  well-defined  heartbeats  [27]  or  breaths  [28].) 

Third,  the  data  set  was  notable  in  that  many  patients  were 
missing  a  full  set  of  reliable  data.  However,  we  contend  that 
this  is  a  salient  finding  of  the  study,  rather  than  a  limitation, 


because  it  emphasizes  the  prevalence  of  unreliable  vital  sign 
data.  At  the  same  time,  it  did  not  hamper  the  univariate 
analyses  because  there  were  suitably  large  populations  for 
each  analysis.  Finally,  for  the  multivariate  analysis,  we  were 
able  to  report  a  valid  ROC  AUC  for  the  broadest  study 
population  (any  patient  with  at  least  1  reliable  vital  sign 
within  the  first  1 5  minutes)  by  using  an  ensemble  classifier, 
which  can  tolerate  missing  data.  The  performance  of  the 
ensemble  classifier  was  assessed  through  cross-validation, 
that  is,  with  distinct  training  and  testing  patient  populations. 

In  terms  of  the  external  validity  of  the  study,  the  issues 
that  we  studied  have  been  previously  recognized  [1,2, 4-6]. 
This  report  offers  a  novel,  quantitative  analysis  of  their 
magnitude  of  effect  in  actual  prehospital  practice.  It  is  not 
certain  to  what  extent  the  quantitative  results  of  this  analysis 
will  apply  to  different  clinical  settings,  for  example, 
emergency  department  vs  hospital  ward  vs  ground  EMS, 
and  different  make  and  model  of  patient  monitors.  Likewise, 
there  may  be  salient  differences  given  alternative  popula¬ 
tions,  for  example,  patients  older  in  age  with  a  higher  rate  of 
/Tblocker  medication.  However,  the  study  population  of  this 
report  was  reasonably  large  (>600  subjects);  and  such 
considerations  were  outside  its  scope.  This  analysis  provides 
a  prima  facie  demonstration  that  each  of  the  factors  is 
important  and  that  specific  strategies  can  significantly  alter 
diagnostic  test  characteristics  of  routine  clinical  data.  Further 
work  is  warranted  to  explore  these  factors  in  a  diversity  of 
clinical  arenas  and  populations. 


5.  Conclusion 

The  study  is  notable  for  quantifying  the  magnitude  of  the 
effect  of  physiologic  variability  and  measurement  error  on  a 
diagnostic  application  of  vital  signs.  These  sources  of 
variability  were  commonplace  in  this  clinical  data  analysis. 
Techniques  that  accounted  for  the  variability  yielded 
significantly  improved  diagnostic  test  characteristics.  Vital 
sign  data  are  often  treated  uncritically  in  published  reports. 
The  findings  here  suggest  that  these  factors  should  be 
carefully  considered  when  using  vital  signs  in  clinical 
practice  or  research  protocols. 
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